Getting Set-up

Installing and Setting up RStudio

The R console looks like this:

File Organization

Make sure that you set up a folder for this class.

Using RMarkdown/knitr

You can knit the file. The first time you do this you will need to make sure you have the knitr package installed. You have the option to knit into .html, .pdf, and .doc. In general, in this course we will be knitting into .html.

RMarkdown formatting

To make something “code-looking” we use the grave accent ` found in the upper left of your keyboard.

To create a header, place a hash tag at the start of the line. For example, # Header 1 or create a level 2 header using ## Header Level 2.

To make text italics put asterisk around the text *like this*. To make text bold, put two asterisks around the text **like this**.

To make a list, just start creating your list using a - or * for each bullet, like this:

- list item 1
- list item 2

It is important that there is a blank line before the first bullet.

Add a link with the follwing code:

[Alt text that will display](www.google.com)

It will display like this:

Alt text that will display

Add an image with the following code:

![Alt text](https://raw.githubusercontent.com/allisonhorst/stats-illustrations/master/rstats-artwork/rmarkdown_wizards.png)

It will display like this:

Alt text

Alt text

The vast majority of markdown syntax are available in the RStudio RMarkdown Cheatsheet, Section 3.

R Chunks

Create an R chunk:

2+2
## [1] 4

OR

x<-4

echo=T or echo=F– determines whether or not to echo the source code in the output file. This can be useful if you are creating a document for someone to read that doesn’t need to see or doesn’t want to see you code, just the output. In general in this course for assignments I would like your code to be echoed. The default is echo=F.

results=T or results=F – determines whether or not the results will be displayed. This can be useful if you want to show code, but don’t care what the output is. The default is eval=T.

eval=T or eval=F – determines whether or not to evaluate the code. This can be useful if you have a whole chunk of code you don’t want run, but you also don’t want to. The default is eval=T.

There are many, many more options including fig.width, fig.height, cache, etc. The vast majority of options are available in the RStudio RMarkdown Cheatsheet, Section 5.

You have the option to set the options individually on each chunk and/or set the global options by using the code knitr::opts_chunk$set(your options here)) in the first code chunk.

Inline Code

Rather than using a code chunk (which is centered in the middle of the page), you also have to options to use inline code. You can place the following within any sentence or paragraph.

`r codehere`

For example,

This is the number `r x`.

becomes… This is the number 4.

Installing Packages

Packages can contain lots of things including: data sets, functions, etc.

You can install packages using the packages tab or you can use the code install.packages('packageyouwant') in the console.

In each new R session where you want to use the package you will have to load it by typing library('packageyouwant') in the console (or in the RMarkdown document - more later).

To get help with a package (or a function in a package) you can type ?packagename into the console.

Additional Reading (Optional)

Some Basic R code

Variables, Calculations, Vectors

Assigning Variables:

x <- 2+2
y <- 6

Calculations:

x/2
## [1] 2
x*2
## [1] 8
x+y
## [1] 10

Vectors:

#c() function: concatenate
vector1 <- c(1,2,9,15,1000)

Referencing Elements of a Vector:

vector1[1]
## [1] 1

Functions:

mean(vector1)
## [1] 205.4

If you are ever unsure about a function, you can type ?functionname into the console. In this case, ?mean.

Importing Data

For now, we will mostly be working with .csv and .xls files. Later in the course, we may discuss other types of files.

From a file on your computer:

From a package:

library("openintro")
## Please visit openintro.org for free statistics materials
## 
## Attaching package: 'openintro'
## The following objects are masked from 'package:datasets':
## 
##     cars, trees
cars
##       type price mpgCity driveTrain passengers weight
## 1    small  15.9      25      front          5   2705
## 2  midsize  33.9      18      front          5   3560
## 3  midsize  37.7      19      front          6   3405
## 4  midsize  30.0      22       rear          4   3640
## 5  midsize  15.7      22      front          6   2880
## 6    large  20.8      19      front          6   3470
## 7    large  23.7      16       rear          6   4105
## 8  midsize  26.3      19      front          5   3495
## 9    large  34.7      16      front          6   3620
## 10 midsize  40.1      16      front          5   3935
## 11 midsize  15.9      21      front          6   3195
## 12   large  18.8      17       rear          6   3910
## 13   large  18.4      20      front          6   3515
## 14   large  29.5      20      front          6   3570
## 15   small   9.2      29      front          5   2270
## 16   small  11.3      23      front          5   2670
## 17 midsize  15.6      21      front          6   3080
## 18   small  12.2      29      front          5   2295
## 19   large  19.3      20      front          6   3490
## 20   small   7.4      31      front          4   1845
## 21   small  10.1      23      front          5   2530
## 22 midsize  20.2      21      front          5   3325
## 23   large  20.9      18       rear          6   3950
## 24   small   8.4      46      front          4   1695
## 25   small  12.1      42      front          4   2350
## 26   small   8.0      29      front          5   2345
## 27   small  10.0      22      front          5   2620
## 28 midsize  13.9      20      front          5   2885
## 29 midsize  47.9      17       rear          5   4000
## 30 midsize  28.0      18      front          5   3510
## 31 midsize  35.2      18       rear          4   3515
## 32 midsize  34.3      17      front          6   3695
## 33   large  36.1      18       rear          6   4055
## 34   small   8.3      29      front          4   2325
## 35   small  11.6      28      front          5   2440
## 36 midsize  61.9      19       rear          5   3525
## 37 midsize  14.9      19       rear          5   3610
## 38   small  10.3      29      front          5   2295
## 39 midsize  26.1      18      front          5   3730
## 40   small  11.8      29      front          5   2545
## 41 midsize  21.5      21      front          5   3200
## 42 midsize  16.3      23      front          5   2890
## 43   large  20.7      19      front          6   3470
## 44   small   9.0      31      front          4   2350
## 45 midsize  18.5      19      front          5   3450
## 46   large  24.4      19      front          6   3495
## 47   small  11.1      28      front          5   2495
## 48   small   8.4      33        4WD          4   2045
## 49   small  10.9      25        4WD          5   2490
## 50   small   8.6      39      front          4   1965
## 51   small   9.8      32      front          5   2055
## 52 midsize  18.2      22      front          5   3030
## 53   small   9.1      25      front          4   2240
## 54 midsize  26.7      20      front          5   3245

Make sure the file is saved in the same folder as your .Rmd file.

NYCairbnb <- read.csv("NYCairbnb2019.csv")

Basics for Working with a Dataframe

Assessing Size:

dim(NYCairbnb)
## [1] 48895    16

Names:

names(NYCairbnb)
##  [1] "id"                             "name"                          
##  [3] "host_id"                        "host_name"                     
##  [5] "neighbourhood_group"            "neighbourhood"                 
##  [7] "latitude"                       "longitude"                     
##  [9] "room_type"                      "price"                         
## [11] "minimum_nights"                 "number_of_reviews"             
## [13] "last_review"                    "reviews_per_month"             
## [15] "calculated_host_listings_count" "availability_365"

Referencing values:

You can reference a particular row and/or column of a dataset by using dataset[row,column]. For example, if I wanted to know the value in the 1st row, 3rd column in the NYCairbnb dataset, I would use the command

NYCairbnb[1,3]
## [1] 2787

Referencing Columns:

NYCairbnb$price
NYCairbnb[,"price"]
NYCairbnb[,10]


attach(NYCairbnb)
price

Calculations:

mean(NYCairbnb$price)
## [1] 152.7207
sd(NYCairbnb$price)
## [1] 240.1542

Conditional Subsetting:

#prints out all the rows where the price per night is more than $8000 per night
NYCairbnb[NYCairbnb$price >=8000,]
##             id                                               name  host_id
## 4378   2953058                                      Film Location  1177497
## 6531   4737930                                 Spanish Harlem Apt  1235070
## 9152   7003697                Furnished room in Astoria apartment 20582832
## 12343  9528920                Quiet, Clean, Lit @ LES & Chinatown  3906464
## 17693 13894339    Luxury 1 bedroom apt. -stunning Manhattan views  5143901
## 29239 22436899                                1-BR Lincoln Center 72390391
## 30269 23377410  Beautiful/Spacious 1 bed luxury flat-TriBeCa/Soho 18128455
## 40434 31340283 2br - The Heart of NYC: Manhattans Lower East Side  4382127
##       host_name neighbourhood_group   neighbourhood latitude longitude
## 4378    Jessica            Brooklyn    Clinton Hill 40.69137 -73.96723
## 6531      Olson           Manhattan     East Harlem 40.79264 -73.93898
## 9152   Kathrine              Queens         Astoria 40.76810 -73.91651
## 12343       Amy           Manhattan Lower East Side 40.71355 -73.98507
## 17693      Erin            Brooklyn      Greenpoint 40.73260 -73.95739
## 29239    Jelena           Manhattan Upper West Side 40.77213 -73.98665
## 30269       Rum           Manhattan         Tribeca 40.72197 -74.00633
## 40434      Matt           Manhattan Lower East Side 40.71980 -73.98566
##             room_type price minimum_nights number_of_reviews last_review
## 4378  Entire home/apt  8000              1                 1  2016-09-15
## 6531  Entire home/apt  9999              5                 1  2015-01-02
## 9152     Private room 10000            100                 2  2016-02-13
## 12343    Private room  9999             99                 6  2016-01-01
## 17693 Entire home/apt 10000              5                 5  2017-07-27
## 29239 Entire home/apt 10000             30                 0            
## 30269 Entire home/apt  8500             30                 2  2018-09-18
## 40434 Entire home/apt  9999             30                 0            
##       reviews_per_month calculated_host_listings_count availability_365
## 4378               0.03                             11              365
## 6531               0.02                              1                0
## 9152               0.04                              1                0
## 12343              0.14                              1               83
## 17693              0.16                              1                0
## 29239                NA                              1               83
## 30269              0.18                              1              251
## 40434                NA                              1              365
NYCairbnb[NYCairbnb$price >=8000, c("host_name","neighbourhood", "room_type","price")]
##       host_name   neighbourhood       room_type price
## 4378    Jessica    Clinton Hill Entire home/apt  8000
## 6531      Olson     East Harlem Entire home/apt  9999
## 9152   Kathrine         Astoria    Private room 10000
## 12343       Amy Lower East Side    Private room  9999
## 17693      Erin      Greenpoint Entire home/apt 10000
## 29239    Jelena Upper West Side Entire home/apt 10000
## 30269       Rum         Tribeca Entire home/apt  8500
## 40434      Matt Lower East Side Entire home/apt  9999

Best Practices

Commenting

  • Be sure to comment your code (in R, use a # before a line of comment)
  • The more descriptive you can be the easier it will be for other to read (and for you to read later)

Naming

When naming variables, observations, data frames, or files, make them:

  • meaningful
  • consistent
  • concise
  • code and coder friendly

Other naming considerations:

  • avoid names that are common/used function names (ie. filter or mean)
  • consider making object names nouns, and function names verbs
  • it’s not the end of the world if you give something a bad name, but it will save you (and others) time and effort down the road
  • avoid formatting and symbols (ie. spaces or &)
  • keep a clear record of your variable names as well as longer descriptions including units (ie. surface_temp= surface temperature measurement on Mars in degrees Celsius)

Entering Things

Some suggestions for best practices:

  • be consistent (ie. purple vs. Purple vs. purple_)
  • put any additional information such as units or notes in a column separate from the value
  • if there is missing entries, enter the name thing for each missing value (it is common to use NA, NaN, -9999, -); don’t leave cells blank
  • if data is abbreviated, make a record somewhere of how the what they mean

Example

by @alisonhorst

Bad data entry, by @alisonhorst Good data entry, by @alisonhorst